Part 1: Model building in scikit-learn


In [1]:
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()

In [2]:
# store the feature matrix (X) and response vectore (y)
X = iris.data
y = iris.target

"Features" are also known as predictors, inputs or attributes. The "reponse" is also known as the target, label or output.


In [3]:
# check the shapes of X and y
print(X.shape)
print(y.shape)


(150, 4)
(150,)

"Observations" are also known as samples, instances, or records.


In [5]:
# examine the first 5 rows of the feature matrix
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()


Out[5]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

In [6]:
y[:5]


Out[6]:
array([0, 0, 0, 0, 0])

In [8]:
pd.Series(y).value_counts()


Out[8]:
2    50
1    50
0    50
dtype: int64

In order to build a model, the features must be numeric, and every observation must have the same features in the same order.


In [10]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier()

# fit the model with data (occurs in-place)
knn.fit(X, y)


Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.


In [12]:
# predict the response for a new observation
knn.predict([[3, 5, 4, 2]])


Out[12]:
array([1])

Part 2: Representing text as numerical data


In [13]:
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [14]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

vect.fit(simple_train)


Out[14]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [15]:
vect.get_feature_names()


Out[15]:
['cab', 'call', 'me', 'please', 'tonight', 'you']

In [16]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm


Out[16]:
<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [17]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()


Out[17]:
array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

In [18]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())


Out[18]:
cab call me please tonight you
0 0 1 0 0 1 1
1 1 1 1 0 0 0
2 0 1 1 2 0 0

In [19]:
# examine the sparse matrix contents
print(simple_train_dtm)


  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.


In [20]:
# example text for model testing
simple_test = ["please don't call me"]

In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.


In [21]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()


Out[21]:
array([[0, 1, 1, 1, 0, 0]])

Summary:

  • vect.fit(train) learns the vocabulary of the training data
  • vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
  • vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

Part 3: Reading a text-based dataset into pandas


In [31]:
# read file into pandas using a relative path
# alternative: read file into pandas from a URL
# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
path = 'sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])

In [32]:
# examine the shape
sms.shape


Out[32]:
(5572, 2)

In [33]:
# examine the first 10 rows
sms.head(10)


Out[33]:
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
5 spam FreeMsg Hey there darling it's been 3 week's n...
6 ham Even my brother is not like to speak with me. ...
7 ham As per your request 'Melle Melle (Oru Minnamin...
8 spam WINNER!! As a valued network customer you have...
9 spam Had your mobile 11 months or more? U R entitle...

In [34]:
# examine the calss distribution
sms.label.value_counts()


Out[34]:
ham     4825
spam     747
Name: label, dtype: int64

In [36]:
# examine the calss distribution
sms.label.value_counts() * 100 / sms.shape[0]


Out[36]:
ham     86.593683
spam    13.406317
Name: label, dtype: float64

In [38]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({
    'ham' : 0,
    'spam': 1
})

In [39]:
# check that the conversion worked
sms.head(10)


Out[39]:
label message label_num
0 ham Go until jurong point, crazy.. Available only ... 0
1 ham Ok lar... Joking wif u oni... 0
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1
3 ham U dun say so early hor... U c already then say... 0
4 ham Nah I don't think he goes to usf, he lives aro... 0
5 spam FreeMsg Hey there darling it's been 3 week's n... 1
6 ham Even my brother is not like to speak with me. ... 0
7 ham As per your request 'Melle Melle (Oru Minnamin... 0
8 spam WINNER!! As a valued network customer you have... 1
9 spam Had your mobile 11 months or more? U R entitle... 1

In [40]:
# define X and y from the SMS data for use with CountVectorizer
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)


(5572,)
(5572,)

In [42]:
# split X, y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1087)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(4179,)
(1393,)
(4179,)
(1393,)

Part 4: Vectorizing our dataset


In [43]:
# instantiate the vectorizer
vect = CountVectorizer()

# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

# examine the document matrix
X_train_dtm


Out[43]:
<4179x7468 sparse matrix of type '<class 'numpy.int64'>'
	with 55716 stored elements in Compressed Sparse Row format>

In [44]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm


Out[44]:
<1393x7468 sparse matrix of type '<class 'numpy.int64'>'
	with 17087 stored elements in Compressed Sparse Row format>

Part 5: Building and evaluation of a model


In [45]:
# import and instatiate a multinomial naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [46]:
# train the model using X_train_dtm (timing it)
%time nb.fit(X_train_dtm, y_train)


CPU times: user 9.03 ms, sys: 0 ns, total: 9.03 ms
Wall time: 7.44 ms
Out[46]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [47]:
# make class prediction for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [48]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)


Out[48]:
0.98133524766690594

In [49]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)


Out[49]:
array([[1200,    4],
       [  22,  167]])

In [50]:
# print message text for the false positives (ham incorrectly classified as spam)
X_test[ y_test < y_pred_class]


Out[50]:
4862                    Nokia phone is lovly..
4382    Mathews or tait or edwards or anderson
4703                                Anytime...
4702                    I liked the new mobile
Name: message, dtype: object

In [51]:
# print message text for false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]


Out[51]:
1875    Would you like to see my XXX pics they are so ...
2003    TheMob>Yo yo yo-Here comes a new selection of ...
5370    dating:i have had two of these. Only started a...
5       FreeMsg Hey there darling it's been 3 week's n...
1458    CLAIRE here am havin borin time & am now alone...
2774    How come it takes so little time for a child w...
1269    Can U get 2 phone NOW? I wanna chat 2 set up m...
955             Filthy stories and GIRLS waiting for your
3425    Am new 2 club & dont fink we met yet Will B gr...
4256    Block Breaker now comes in deluxe format with ...
3419    LIFE has never been this much fun and great un...
3864    Oh my god! I've found your number again! I'm s...
3991    (Bank of Granite issues Strong-Buy) EXPLOSIVE ...
5037    You won't believe it but it's true. It's Incre...
1430    For sale - arsenal dartboard. Good condition b...
3360    Sorry I missed your call let's talk when you h...
4514    Money i have won wining number 946 wot do i do...
3460    Not heard from U4 a while. Call me now am here...
3501    Dorothy@kiefer.com (Bank of Granite issues Str...
1137    Dont forget you can place as many FREE Request...
684     Hi I'm sue. I am 20 years old and work as a la...
2823    ROMCAPspam Everyone around should be respondin...
Name: message, dtype: object

In [52]:
# example false negative
X_test[5037]


Out[52]:
"You won't believe it but it's true. It's Incredible Txts! Reply G now to learn truly amazing things that will blow your mind. From O2FWD only 18p/txt"

In [53]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob


Out[53]:
array([  4.55405595e-06,   1.79459488e-16,   9.86454990e-08, ...,
         9.80960813e-10,   2.01202626e-04,   1.03984579e-06])

In [54]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)


Out[54]:
0.96630719471251036

Part 6: Comparing models

We will compare multinomial naive Bayes with logistic regression


In [55]:
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
lreg = LogisticRegression()

In [57]:
# train the model using X_train_dtm
%time lreg.fit(X_train_dtm, y_train)


CPU times: user 156 ms, sys: 15.8 ms, total: 172 ms
Wall time: 158 ms
Out[57]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [59]:
# make class predictions using X_test_dtm
y_pred_class = lreg.predict(X_test_dtm)

In [60]:
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = lreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob


Out[60]:
array([ 0.00365935,  0.00131051,  0.00370401, ...,  0.00567854,
        0.01299843,  0.01867533])

In [61]:
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)


Out[61]:
0.97631012203876522

In [62]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)


Out[62]:
0.9841181950816501

Part 7: Examining a model for further insight


In [ ]: